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ABSTRACT 


We address the problem of predicting the correctness of 
the student’s response on the next exam question based on 
their previous interactions in the course of their learning 
and evaluation process. We model the student performance 
as a dynamic problem and compare the two major classes 
of dynamic neural architectures for its solution, namely the 
finite-memory Time Delay Neural Networks (TDNN) and 
the potentially infinite-memory Recurrent Neural Networks 
(RNN). Since the next response is a function of the knowl- 
edge state of the student and this, in turn, is a function of 
their previous responses and the skills associated with the 
previous questions, we propose a two-part network architec- 
ture. The first part employs a dynamic neural network (ei- 
ther TDNN or RNN) to trace the student knowledge state. 
The second part applies on top of the dynamic part and it 
is a multi-layer feed-forward network which completes the 
classification task of predicting the student response based 
on our estimate of the student knowledge state. Both input 
skills and previous responses are encoded using different em- 
beddings. Regarding the skill embeddings we tried two dif- 
ferent initialization schemes using (a) random vectors and 
(b) pretrained vectors matching the textual descriptions of 
the skills. Our experiments show that the performance of the 
RNN approach is better compared to the TDNN approach in 
all datasets that we have used. Also, we show that our RNN 
architecture outperforms the state-of-the-art models in four 
out of five datasets. It is worth noting that the TDNN ap- 
proach also outperforms the state of the art models in four 
out of five datasets, although it is slightly worse than our 
proposed RNN approach. Finally, contrary to our expec- 
tations, we find that the initialization of skill embeddings 
using pretrained vectors offers practically no advantage over 
random initialization. 
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1. INTRODUCTION 


Knowledge is distinguished by the ability to evolve over 
time. This progression of knowledge is usually incremen- 
tal and its formation is related to the cognitive areas being 
studied. The process of Knowledge Tracing (KT) defined as 
the task of predicting students’ performance has attracted 
the interest of many researchers in recent decades [4]. The 
Knowledge State (KS) of a student is the degree of his or 
her mastering the Knowledge Components (KC) in a certain 
domain, for example “Algebra” or “Physics”. A knowledge 
component generally refers to a learnable entity, such as a 
concept or a skill, that can be used alone or in combination 
with other KCs in order to solve an exercise or a problem 
[9]. Knowledge Tracing is the process of modeling and as- 
sessing a student’s KS in order to predict his or her ability 
to answer the next problem correctly. The estimation of the 
student’s knowledge state is useful for improving the educa- 
tional process by identifying the level of his/her understand- 
ing of the various knowledge components. By exploiting this 
information it is possible to suggest appropriate educational 
material to cover the student’s weaknesses and thus maxi- 
mize the learning outcome. 


The main problem of Knowledge Tracing is the efficient man- 
agement of the responses over time. One of the factors which 
add complexity to the problem of KT is the student-specific 
learning pace. The knowledge acquisition may differ from 
person to person and may also be influenced by already ex- 
isting knowledge. More specifically, KT is predominantly 
considered as a supervised sequence learning problem where 
the goal is to predict the probability that a student will an- 
swer correctly the future exercises, given his or her history 
of interactions with previous tests. Thus, the prediction of 
the correctness of the answer is based on the history of the 
student’s answers in combination with the skill that is cur- 
rently examined at this time instance. 


Mathematically, the KT task is expressed as the probability 
P(ri+1 = 1lqe4i, Xe) that the student will offer the correct 
response in the next interaction r++1, where the students 
learning activities are represented as a sequence of interac- 
tions X; = {x1,2,x3,...,v¢} over time T. The 2; interac- 
tion consists of a tuple (q,rz) which represents the ques- 
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tion q being answered at time ¢t and the student response 
rz, to the question. Without loss of generality, we shall as- 
sume that knowledge components are represented by skills 
from a set S = {81, 82, ...,8m}. One simplifying assumption, 
used by many authors [24], is that every question in the set 
Q = {qu,@,---, qr} is related to a unique skill from S. Then 
the knowledge levels of the student for each one of the skills 
in S compose his or her knowledge state. 


The dynamic nature of Knowledge Tracing leads to approa- 
ches that have the ability to model time-series or sequential 
data. In this work we propose two dynamic machine learning 
models that are implemented by time-dependent methods, 
specifically recurrent and time delay neural networks. Our 
models outperform the current state-of-the-art approaches 
in four out of five benchmark datasets that we have studied. 
The proposed models differ from the existing ones in two 
main architectural aspects: 


e we find that attention does not help improve the per- 
formance and therefore we make no use of attention 
layers 


e we experiment with and compare between two dif- 
ferent skill embedding types: (a) initialized by pre- 
trained embeddings of the textual descriptions of the 
skill names using standard methods such as Word2Vec 
and FastText and (b) randomly initialized embeddings 
based on skill ids 


The rest of the paper is organized as follows. Section 2 re- 
views the related works on KT and the existing models for 
student performance prediction. In Section 3 we present our 
proposed models and describe their architecture and char- 
acteristics. The datasets we prepared and used are present 
in Section 4 while the experiments setup and the results 
are explained in Section 5. Finally, Section 6 concludes this 
work and discusses the future works and extensions of the 
research. 


2. RELATED WORKS 


The problem of knowledge tracing is dynamic as student 
knowledge is constantly changing over time. Thus, a variety 
of methods, highly structured or dynamic, have been pro- 
posed to predict students’ performance. One of the earlier 
methods is Bayesian Knowledge Tracing (BKT) [4] which 
models the problem as a Hidden Markov chain in order to 
predict the sequence of outcomes for a given learner. The 
Performance Factors Analysis Model (PFA) [14] proposed to 
tackle the knowledge tracing task by modifying the Learning 
Factor Analysis model. It estimates the probability that a 
student will answer a question correctly by maximizing the 
likelihood of a logistic regression model. The features used 
in the PFA model, although interpretable, are relatively sim- 
ple and designed by hand, and may not adequately represent 
the students’ knowledge state [23]. 


Deep Knowledge Tracing (DKT) [15] is the first dynamic 
model proposed in the literature utilizing recurrent neural 
networks (RNN) and specifically the Long Short-Term Mem- 
ory (LSTM) model [6] to track student knowledge. It uses 
one-hot encoded skill tags and associated responses as inputs 


and it trains the neural network to predict the next student 
response. The hidden state of the LSTM can be considered 
as the latent knowledge state of a student and can carry the 
information of the past interactions to the output layer. The 
output layer of the model computes the probability of the 
student answering correctly a question relating to a specific 
Knowledge Component. 


Another approach for predicting student performance is the 
Dynamic Key-Value Memory Network (DKVMN) [24] which 
relies on an extension of memory networks proposed in [12]. 
The model tries to capture the relationship between differ- 
ent concepts. The DKVMN model outperforms DKT us- 
ing memory slots as key and value components to encode 
the knowledge state of students. Learning or forgetting of 
a particular skill are stored in those components and con- 
trolled by read and write operations through the Least Re- 
cently Used Access (LRUA) attention mechanism [16]. The 
key component is responsible for storing the concepts and is 
fixed during testing while the value component is updated 
when a concept state changes. The latter means that when 
a student acquires a concept in a test the value component 
is updated based on the correlation between exercises and 
the corresponding concept. 


The Deep-IRT model [23] is the newest approach that ex- 
tends the DKVMN model. The author combined the capa- 
bilities of DKVMN with the Item Response Theory (IRT) 
[5] in order to measure both student ability and question dif- 
ficulty. At the same time, another model, named Sequential 
Key-Value Memory Networks (SKVMN) [1], tried to over- 
come the problem of DKVMN to capture long term depen- 
dencies in the sequences of exercises and generally in sequen- 
tial data. This model combines the DKVMN mechanism 
with the Hop-LSTM, a variation of LSTM architecture and 
has the ability to discover sequential dependencies among 
exercises, but it skips some LSTM cells to approach previ- 
ous concepts that are considered relevant. Finally, another 
newly proposed model is Self Attentive Knowledge Tracing 
(SAKT) [13]. SAKT utilizes a self-attention mechanism and 
mainly consists of three layers: an embedding layer for in- 
teractions and questions followed by a Multi-Head Attention 
layer [19] and a feed-forward layer for student response pre- 
diction. 


The above models either use simple features (e.g. PFA) 
or they use machine learning approaches such as key-value 
memory networks or attention mechanisms that may add 
significant complexity. However we will show that similar 
and often, in fact, better performance can be achieved by 
simpler dynamic models combining embeddings and recur- 
rent and/or time-delay feed-forward networks as proposed 
next. 


3. PROPOSED APPROACH 
3.1 Dynamic Models 


As referenced in the relative literature, knowledge change 
over time is often modeled by dynamic neural networks. The 
dynamic models produce output based on a time window, 
called “context window”, that contains the recent history of 
inputs and/or outputs. 


There are two types of dynamic neural networks (Figure 1): 
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(a) Time-Delay Neural Networks (TDNN), with only feed- 
forward connections and finite-memory of length L equal to 
the length of the context window, and (b) Recurrent Neu- 
ral Networks (RNN) with feed-back connections that can 
have potentially infinite-memory although, practically, their 
memory length is dictated by a forgetting factor parameter. 


Yt Ut 


feed-back 


Feed-forward 
connections 


Feed-forward 
connections 


Figure 1: Dynamic model architectures: (a) Time- 
Delay Neural Network (b) Recurrent Neural Net- 
work. 


3.2 The Proposed Models 


We approach the task of predicting the student response 
(0=wrong, 1=correct) on a question involving a specific skill 
as a dynamic binary classification problem. In general, we 
view the response r; as a function of the previous student 
interactions: 


re = h(ae, ae 1, Qt-2,+-+,Tt-1,1r a,.--) +e (1) 


where qt, is the skill tested on time t and e; is the prediction 
error. The response is therefore a function of the current and 
the previous tested skills {q@:, qt-1, @:—2,--- }, aS well as the 
previous responses {rz—1, Tt-2,... } given by the student. 


We implement h as a dynamic neural model. Our proposed 
general architecture is shown in Figure 2. The inputs are 
the skill and response sequences {q}, {r} collected during 
a time-window of length L prior to time t. Note that the 
skill sequence includes the current skill q but the response 
sequence does not contain the current response which is ac- 
tually what we want to predict. The architecture consists of 
two main parts: 


e The Encoding sub-network. It is used to represent 
the response and skill input data using different em- 
beddings. Clearly, embeddings are useful for encoding 
skills since skill ids are categorical variables. We found 
that using embeddings to encode responses is also very 
beneficial. The details of the embeddings initialization 
and usage are described in the next section. 


e The Tracing sub-network. This firstly estimates the 
knowledge state of the student and then uses it to pre- 
dict his/her response. Our model function consists of 
two parts: (i) the Knowledge-Tracing part, represented 
by the dynamic model f, which predicts the student 
knowledge state v; and (ii) the classification part g, 


which predicts the student response based on the esti- 
mated knowledge state: 
ve f(Q4,% 1, Qt-2,+-+,Tt-1,1r 2;nee) (2) 
* = g(ve) (3) 


Depending on the memory length, we obtain two cat- 
egories of models: 


(a) models based on RNN networks which can poten- 
tially have infinite memory. In this case the KT 
model is recurrent: 


Ve = f(Ve-1, Gt, Qt-1, +++) MEL) Tt-1) +++ Tr—L) 


(b) models based on TDNN networks which have fi- 
nite memory of length L. In this case the KT 
model has finite impulse response L: 


ve = f (at, ae 1y+++5Qt-L,1t-1,---;Tr L) 


Although RNNs have been used in the relevant literature, it 
is noteworthy that TDNN approaches have not been investi- 
gated in the context of knowledge tracing. The classification 
part is modeled by a fully-connected feed-forward network 
with a single output unit. 


Encoding 
Classification 


Encoding 


Tracing Sub-net 


Figure 2: General proposed architecture. The dy- 
namic model can be either a Recurrent Neural Net- 
work (with a feedback connection from the output 
of the dynamic part into the model input) or a Time 
Delay Neural Network (without feedback connec- 
tion). 


We investigated two different architectures: one based on 
recurrent neural networks and another based on time delay 
neural networks. The details of each proposed model archi- 
tecture are described below. 


3.3. Encoding Sub-network 

The first part in all our proposed models consists of two 
parallel embedding layers with dimensions d, and d,, re- 
spectively, which encode the tested skills and the responses 
given by the student. During model training the weights of 
the Embedding layers are updated. The response embed- 
ding vectors are initialized randomly. The skill embedding 
vectors, on the other hand, are initialized either randomly 
or using pretrained data. In the latter case we use pre- 
trained vectors corresponding to the skill names obtained 
from Word2Vec [11] or FastText [7] methods. 


A 1D spatial dropout layer [18] is added after each Em- 
bedding layer. The intuition behind the addition of spatial 
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dropout was the overfitting phenomenon that was observed 
in the first epochs of each validation set. We postulated that 
the correlation among skill name embeddings, that might 
not actually exist, confused the model. 


3.4 Tracing Sub-network 

We experimented with two types of main dynamic sub-net- 
works, namely Recurrent Neural Networks and Time Delay 
Neural Networks. These two approaches are described next. 


3.4.1 RNN Approach: Bi-GRU Model 
The model architecture based on the RNN method for the 
knowledge tracing task is shown in Figure 3. 


questions q@,...,Q—L 


Skill Embeddings 


responses T-1,.--,Tt—L 


Response Embeddings 
Spatial Dropout Spatial Dropout 
Convolutional Convolutional 


Cc “| Bidirectional-GRU 
Gaussian Dropout 


Vt 
Dense 


output *F; 


Figure 3: Bi-GRU model 


The Spatial Dropout rate following the input embedding 
layers is 0.2 for most of used datasets. Next, we feed the 
skills and the responses input branches into a Convolutional 
layer consisting of 100 filters, with kernel size 3, stride 1, 
and ReLU activation function. The Convolutional layer acts 
as a projection mechanism that reduces the input dimen- 
sions from the previous Embedding layer. This is found to 
help alleviate the overfitting problem. To the best of our 
knowledge, Convolutional layers have not been used in pre- 
viously proposed neural models for this task. The two in- 
put branches are then concatenated to feed a Bidirectional 
Gated Recurrent Unit (GRU) layer with 64 units [3]. Batch 
normalization and ReLU activation layers are applied be- 
tween convolutional and concatenation layers. This struc- 
ture has resulted after extensive experiments with other pop- 
ular recurrent models such as LSTM, plain GRU and also 
bi-directional versions of those models and we found this 
to be the proposed architecture is the most efficient one. 


On top of the RNN layer we append a fully connected sub- 
network consisting of three dense layers with 50 and 25 units 
and one output unit respectively. The first two dense layers 
have a ReLU activation function while the last one has sig- 
moid activation which is used to make the final prediction 
(0<% <1). 


3.4.2. TDNN Approach 


In our TDNN model (Figure 4) we add a Convolutional layer 
after each embedding layer with 50 filters and kernel size 
equal to 5. 


questions q@,...,Q—L 


Skill Embeddings 
Spatial Dropout Spatial Dropout 
Convolutional Convolutional 


Gaussian Dropout 


Vi 


responses Ty_1,.--, TL 


Response Embeddings 


output 7; 


Figure 4: TDNN model 


Batch normalization is used before the ReLU activation is 
applied. As with the RNN model, the two input branches 
are concatenated to feed the classification sub-network. It 
consists of four dense layers with 20, 15, 10, and 5 units 
respectively, using the ReLU activation function. This fun- 
nel schema of hidden layers (starting with wider layers and 
continuing with narrower ones) has helped achieve better 
results for all datasets we have experimented with. In the 
beginning of the classification sub-network we insert a Gaus- 
sian Dropout layer [17] which multiplies neuron activations 
with a Gaussian random variable of mean value 1. This has 
been shown to work as good as the classical Bernoulli noise 
dropout and in our case even better. 


4. DATASETS 


We tested our models using four popular datasets from the 
ASSISTments online tutoring platform. Three of them, “AS- 
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Table 1: Datasets Overview. 


Dataset Skills | Students | Responses | Baseline Accuracy 
ASSISTment09 110 4,151 325,637 65.84% 
ASSISTment09 corrected | 101 4,151 274,590 66.31% 
ASSISTment 12 196 28,834 2,036,080 69.65% 
ASSISTment17 101 1,709 864,713 62.67% 
FSAI-F1toF3 99 310 51,283 52.98% 


SISTment09”, “ASSISTment09 corrected”, and “ASSIST- 
ment12”” were provided by the above platform. The fourth 
dataset, named “ASSISTment17” was obtained from 2017 
Data Mining competition page®. Finally a fifth dataset, 
“FSAI-F1toF3” provided by “Find Solution Ai Limited” was 
also used in our experiments. It is collected using data from 
the from the 4LittleTrees* adaptive learning application. 


4.1 Datasets Descriptions 

The ASSISTments datasets contain data from student tests 
on mathematical problems [2] and the content is organized in 
columns style. The student’s interaction is recorded on each 
line. There are one or more interactions recorded for each 
student. We take into account the information concerning 
the responses of students to questions related with a skill. 
Thus, we use the following columns: “user_id”, “skill_id”, 
“skill name”, and “correct”. The “skill name” contains a ver- 
bal description of the skill tested. The “correct” column con- 
tains the values of the students’ responses which are either 
1 (for correct) or 0 (for wrong). 


The original “ASSTSTment09” dataset contains 525,534 stu- 
dent responses. It has been used extensively in the KT task 
from several researchers but according to [2] data quality 
issues have been detected concerning duplicate rows. In 
our work we used the “preprocessed ASSIS Tment09” dataset 
found on DKVMN? and Deep-IRT® models GitHubs. In this 
dataset the duplicate rows and the empty field values were 
cleaned, so that finally 1,451 unique students participate 
with 325,623 total responses and 110 unique skills. 


Even after this cleaning there are still some problems such as 
duplicate skill ids for the same skill name. These problems 
have been corrected in the ”Assistment09 corrected” dataset. 
This dataset contains 346,860 students interactions and has 
been recently used in [21]. 


The “ASSIS Tment12” dataset contains students’ data un- 
til the school year 2012-2013. The initial dataset contains 
6,123,270 responses and 198 skills. Some of the skills have 
the same skill name but different skill id. The total num- 
ber of skill ids is 265. The “Assistment17” dataset contains 
942,816 students responses and 101 skills. 


‘https: //sites.google.com/site /assistmentsdata/home/assis 
tment-2009-2010-data/skill-builder-data-2009-2010 

“https: / /sites.google.com/site/assistmentsdata/home/2012- 
13-school-data-with-affect 

3https: / /sites.google.com/view /assistmentsdatamining/dat 
a-mining-competition-2017 

‘https: //www.4littletrees.com 

https: //github.com/jennyzhang0215/DKVMN 

Shttps: //github.com/ckyeungac/DeepIRT 


Finally, the “FSAI-F1toF3” dataset is the smallest dataset 
we used. It involves responses to mathematical problems 
from 7th grade to 9th grade Hong Kong students and con- 
sists of 51,283 students responses from 310 students on 99 
skills and 2,266 questions. As it is commonly the case in 
most studies using this dataset, we have used the question 
tag as the model input q@. 


4.2 Data Preprocessing 

No preprocessing was performed on the “ASSISTment09” 
and “FSAI-FitoF3” datasets. For the remaining datasets 
we followed three preparation steps. 


First, the skill ids had been repaired by replacement. In par- 
ticular, the “ASSTSTments09 corrected” dataset contained 
skills of the form of “skilli_skill2” and “skill1_skill2_skill3” 
which correspond to the same skill names, so we have merged 
them into the first skill id, found before the underscore. In 
other words, the skill “70_138” was replaced with skill “10” 
and so on. Moreover, few misspellings were observed that 
were corrected and the punctuations found in three skill 
names were converted to the corresponding words. For ex- 
ample, in the skill name “Parts of a Polnomial Terms Coef- 
ficient Monomial Exponent Variable” we corrected the “Pol- 
nomial” with “Polynomial”. Also, in the skill name “Or- 
der of Operations +,-,/,*() positive reals” we replaced the 
symbols “+,-,/,* ()” with the words that express these sym- 
bols, ie. “addition subtraction division multiplication paren- 
theses”. The latter preprocessing action was preferred over 
the removal of punctuations since the datasets referred to 
mathematical methods and operations and without them, 
we would lose the meaning of each skill. Similar procedure 
has been followed for the “ASSTSTments12” dataset. Fur- 
thermore, spaces after some skill names were removed i.e. 
the skill name “Pattern Finding ” became “Pattern Find- 
ing”. In the “ASSISTment17” dataset we came across skill 
names as “application: multi-column subtraction” and cor- 
rected them by replacing punctuation marks such as “appli- 
cation multi column subtraction”. That text preparation op- 
erations made to ease the generation of word embeddings of 
the skill names descriptions. In addition, in the “ASSIST- 
ment17” dataset, the problem ids are used instead of the 
skill ids. We had to match and replace the problem ids with 
the corresponding skill ids with the aim of uniformity of the 
datasets between them. 


Secondly, all rows containing missing values were discarded. 
Thus, after the preprocessing, the statistics of the data sets 
were formulated as described in the Table 1. 


Finally, we split the datasets so that 70% was used for train- 
ing and 30% for testing. Then, the training subset was fur- 
ther split into five train-validation subsets using 80% for 
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training and 20% for validation. 


5. EXPERIMENTS 


In this section we experimentally validate the effectiveness 
of the proposed methods by comparing them with each other 
and also with other state-of-the-art performance prediction 
models. The Area Under the ROC Curve (AUC) [10] metric 
is used for comparing the predicting probability correctness 
of student’s response. 


The state-of-the-art knowledge tracing models we are com- 
pared with the DKT, DKVMN and Deep-IRT. We performed 
the experiments for our proposed models Bi-GRU, TDNN 
as well as for each of the previous model for all datasets, us- 
ing the code provided by the authors on their GitHubs. It 
is worth noting that the python GitHub code’ used for the 
DKT model experiments requires the entire dataset file and 
the train/test splitting is performed during the code execu- 
tion. 


All the experiments were performed on a workstation with 
Ubuntu operating system, Intel i5 CPU and 16GB Titan Xp 
GPU card. 


5.1 Skill embeddings initialization 

As mentioned earlier, skill embeddings are initialized either 
randomly or using pretrained vectors. Regarding the ini- 
tialization of the skill embeddings with pretrained vectors 
we used two methods described next. In first method we 
used the text files from Wikipedia2Vec* [22] that is based 
on Word2Vec method and contains pretrainable embeddings 
for the word representation vectors in English language in 
100 and 300 dimensions. In second method we used the “S/S- 
TER?” (SImple SenTence EmbeddeR)? library to prepare the 
skill name embeddings based on FastText in 300 dimensions 
pretrained word embeddings. Each skill name consists of one 
or more words. Thus, for the Word2Vec method, the skill 
name embeddings vector is created by adding the word em- 
beddings vectors, while in case of FastText, the skill name 
embeddings are created by taking the average of the word 
embeddings. 


Especially for the FsaiF1toF3 dataset, the question embed- 
dings are initialized either randomly or using the pretrained 
word representations of the corresponding skill descriptions 
by employing the Wikipedia2Vec and SISTER methods as 
described above. Since many questions belong to the same 
skill, in this case the corresponding rows in the embedding 
matrix are initialized by the same vector. 


5.2 Experimental Settings 

We performed the cross-validation method for the 5 train- 
ing and validation set pairs. This was to choose the best ar- 
chitecture and parameter settings for each of the proposed 
models. Using the train and test sets we evaluated the cho- 
sen architectures for all the datasets. 


“https: / /github.com/Iccasagrande/Deep-Knowledge- 
Tracing 


Shttps: / /wikipedia2vec.github.io/wikipedia2vec/ 
*https://pypi.org/project/sister / 


One of the basic hyperparameters of our models that affect 
to the inputs is the L. It represents the student’s interaction 
history window length. The inputs with L sequence of ques- 
tions and L — 1 sequence of responses. The best results we 
succeeded are when using L = 50 for the both Bi-GRU and 
TDNN models. The batch sizes used in the models during 
the training are: 32 in Bi-GRU and 50 in TDNN. 


Since specific dimensions of the pretrained word embeddings 
are provided, we used the same dimensions in case of random 
embedding in order to take the comparable results. Skill 
embeddings and responses embeddings set in the same di- 
mensions. 


The scheduler learning rate is implementing in Bi-GRU start- 
ing from 0.001 and reducing over the training operation of 
the models that performs for 30 epochs. During training we 
applied the following learning rate schedule depending on 
the epoch number n: 


‘lf Oe ifn < 10 
rinse X CF C2-™) otherwise 


In case of the TDNN-based model, the learning rate equals 
0.001 and is the same during the whole training process for 
30 epochs. We used cross-entropy optimization criterion and 
the Adam or AdaMax [8] learning algorithms. 


Dropout with rate = 0.2 or 0.9 is also applied to the Bi-GRU 
model while the dropout rate of the TDNN equals to one of 
the (0.2, 0.4, 0.6, 0.9) values through to the Gaussian dropout 
layer. We observed a reduction of overfitting during model 
training by changing the Gaussian dropout rate relative to 
the dataset’s size. Thus, the smaller dataset size is, the 
bigger dropout rate has been used. 


The various combinations of parameters settings were ap- 
plied during the experimental process for all proposed mod- 
els presented in Table 2. 


5.3. Experimental Results 

The experiments results of our models are shown in Table 3. 
Comparing our models with each other we can see that the 
RNN-based Bi-GRU model outperforms the TTDNN-based 
model in all datasets. It achieved best results when 100d 
embeddings were used either in pretrained or the random 
initialization type. 


We observed that in both Bi-GRU or TDNN, the embed- 
ding type is not the significant parameter that affects the 
models performance. The differences between the results of 
the experiments showed that the size of embeddings dimen- 
sions not particularly contributed to the final result and the 
difference in performance of the models was small. 


Except for our models, we performed experiments for all 
datasets on the previous models we compared. For three 
of the datasets, specifically for “ASSISTment09 corrected”, 
“ASSTSTment12” and “ASSISTment17” there were not avail- 
able results in the corresponding papers. In this paper, we 
present the results of the experiments we run using that 
models codes. 
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Table 2: 


Models experiments settings 


Parameters Bi-GRU TDNN 
Learning rate 0.001 0.001 
Learning rate schedule yes no 
Training epochs 30 30 
Batch size 32 50 
Optimizer Adam AdaMax 
History window length 50 50 
Skill embeddings dim. 100 & 300 100 & 300 
Skill embeddings type Random, W2V, FastText | Random, W2V, Fast'Text 
Responses embeddings dim. Same to skill dim. Same to skill dim. 
Responses embeddings type Random Random 


Table 3: Comparison between our proposed models - AUC (%). (R) = random skill embedding initialization, 
(W) = skill embedding initialization using W2V, (F) = skill embedding initialization using FastText. Datasets: 
(a) ASSTISTment09, (b) ASSTSTment09 corrected, (c) ASSISTment12, (d) ASSISTment17, (e) FSAI-F1toF3 


d, = 100(R) | d, = 300(R) | d, = 100(W) | d, = 300(W) | d, = 300(F) | 
Bi-GRU 82.55 82.45 82.52 82.55 82.39 
TDNN 81.54 81.67 81.59 81.50 81.53 | 


(a) 


d, = 100(R) 


dq = 300(R) 


dq = 100(W) 


d, = 300(W) | dy = 300(F) 


dq = 100(R) | d, = 300(R) | d, = 100(W) | d, = 300(W) | d, = 300(F) | 

Bi-GRU 73.62 73.58 73.76 73.54 73.58 | 

TDNN 71.68 71.75 71.52 71.81 71.33 | 
(d) 

d, = 100(R) | d; = 300(R) | d, = 100(W) | d, = 300(W) | d, = 300(F) | 

Bi-GRU 70.47 69.34 70.24 69.80 69.51 | 

TDNN 70.03 69.80 69.80 70.11 70.06 | 


The best experimental results of the ours models in com- 
parison with the previous models for each dataset are pre- 
sented in Table 4. The model that has the best performance 
for the four of datasets is the Bi-GRU. Except for that, the 
TDNN-based model has better performance in comparison 
to the previous models for four datasets. The only dataset, 
for which the previous models overcomed our models is the 
“ASSTSTment12”. 


5.4 Discussion 

Our model architecture is loosely based on the DKT model 
and offers improvements in the aspects discussed below. First, 
we employ embeddings for representing both skills and re- 
sponses. It is known that embeddings offer more useful rep- 
resentations compared to one-hot encoding because they can 
capture the similarity between the items they represent [20]. 
Second, we thoroughly examined dynamical neural models 
for estimating the student knowledge state by trying both 
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infinite-ememory RNNs and finite-memory TDNNs. To our 
knowledge, TDNNs have not been well studied in the litera- 
ture with respect to this problem. Third, we used convolu- 
tional layers in the inputs encoding sub-net. We found that 
this layer functioned as a reducing mechanism of the embed- 
ding dimensions and in conjunction with the dropout layer 
mitigated the overfitting problem. The use of Convolutional 
layers is a novelty in models tackling the knowledge tracing 
problem. Fourth, unlike DKT, we used more hidden layers 
in the classification sub-net. Our experiments demonstrate 
that this gives more discriminating capability to the classi- 
fier and improves the results. Finally, our experiments with 
key-value modules and attention mechanism did not help 
further improve our results and so these experiments are not 
reported here. In the majority of the datasets we examined 
our model outperforms the state-off the models employing 
key-value mechanisms such as DKVMN and Deep-IRT. 


In addition to the AUC metric which is typically used for 
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Table 4: Comparison test results of evaluation measures - the AUC metric (%) 


@) dg = d, = 100, Random, 
) gd, =d, = 100, W2V, 


Dataset DKT |] DKVMN | Deep-IRT | BitGRU TDNN 
ASSISTment09 81.56% | 81.61% 81.65% | 82.55%0) | 31.67%°) 
ASSISTment09 corrected | 74.27% | 74.06% 73.41% | 75.27% | 74.40%°) 
ASSISTment12 69.40% | 69.26% | 69.73% 68.40% | 67.99% 
ASSISTment 17 66.85% | 70.25% 70.54% | 73.76% | 71.83%© 
FSALF1toF3 69.42% | 68.40% 68.69% | 70.47%) | 70.11%?) 


) dy = d, = 300, W2V, 
(5) gd, = dy = 300, FastText 


(3) dq = d, = 300, Random, 


Table 5: Statistical significance testing results of Bi-GRU and TDNN 


Dataset P-value 

ASSISTment09 7.34 e-59 
ASSISTment09 corrected | 2.31 e-52 
ASSISTment12 1.45 e-203 
ASSISTment17 7.96 e-44 
FSAI-F1toF3 1.38 e-84 


evaluating the performance of our machine learning mod- 
els, we applied statistical significance testing to check the 
similarity between out Bi-GRU and TDNN models. Specif- 
ically, we performed a T-Test between the outcomes of the 
two models in all training data using the best configuration 
settings as shown in Table 4. The results reported in Table 
5 show that the P-value calculated in all cases is practically 
zero which proves the hypothesis that the two models are 
significantly different. 


6. CONCLUSION AND FUTURE WORK 


In this paper we propose a novel two-part neural network 
architecture for predicting student performance in the next 
exam or exercise based on their performance in previous ex- 
ercises. The first part of the model is a dynamic network 
which tracks the student knowledge state and the second 
part is a multi-layer neural network classifier. For the dy- 
namic part we tested two different models: a potentially 
infinite memory recurrent Bidirectional GRU model and a 
finite memory Time-Delay neural network (TDNN). The ex- 
perimental process showed that the Bi-GRU model achieves 
better performance compared to the TDNN model. De- 
spite the fact that TDNN models have not been used for 
this problem in the past, our results have shown that they 
can be just as efficient or even better compared to previ- 
ous state-of-art RNN models and only slightly worse than 
our proposed RNN model. The model inputs are the stu- 
dent’s skills and responses history which are encoded using 
embedding vectors. Skill embeddings are initialized either 
randomly or by pretrained vectors representing the textual 
descriptions of the skills. A novel feature of our architec- 
ture is the addition of spatial dropout and convolutional 
layers immediately after the embeddings layers. These ad- 
ditions have been shown to reduce the overfitting problem. 
We found that the choice of initialization of the skill embed- 
dings has little effect on the outcome of our experiments. 
Moreover, noting that there is a different use of the same 
datasets in different studies, we described in detail the pro- 
cess of the datasets pre-processing, and we provide the train, 
validation and test splits of the data that were used in our 


experiments on our GitHub repository!®. The extensive ex- 
perimentation with more benchmark datasets as well as the 
study of variants of the proposed models will be the subject 
of our future work with the aim of even further improving 
the prediction performance of the models. 
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